library(readtext);library(stringr);library(tidyr);library(tidytext);
library(ggplot2);library(forcats);library(dplyr);library(ggraph);
library(magrittr);library(topicmodels);library(forcats);library(widyr)
Text mining aims in gathering information from a different type of input data: free text data without any structure of tabular fashion. To achieve that there are two steps to accomplish: the first is to vectorize the input and the other is to investigate common patterns among the new input data. The process of this “understanding” is called Natural Language Processing (NLP) and combines methodologies from linguistics and computer science to complete tasks as speech-to-text conversion, optical character recognition, word segmentation or tokenization, lemmatization and stemming, sentence breaking, terminology extraction, sentiment analysis, relationship semantics, topic recognition, argument recognition and in an advanced level text summarization, natural language generation and question answering.
Our goals here will be limited to text tokenization and distributional semantics with semanting parsing: we will break up the text into tabular data we can analyze and we will try to find and graphically represent word relationships based on the corpus of our data.
Corpus: an unstructured collection of texts, irrespective of language.
Along these processes, terminology will be explained as is occurs. The purpose of the lesson is to give an introduction to the workflow of NLP and the difference with numeric statistical methods - we will calculate statistics but our main goal is to recognize patterns.
The corpus we will use consists of 71 unstructured breast ultrasound reports from one radiologist. They are stored in individual .txt files in UTF-8 encoding and they are all written in Greek. There are two pieces of information written in English, the ACR and the BIRADS classification. The texts may contain dates but we will not attempt to recognize them within context. The same will be applied for dimension measurements; we will not extract them as usable data at this point, they will be extracted as words.
To read data written in Greek into R, we need to define the system parameters such as to avoid reading in UTF-8 encoding. To do that, we need to set the locale to recognize the Greek alphabet: the function controlling these features is the Sys.setlocale(category, locale); the category argument is set to "LC_ALL" and the locale argument is set to "Greek".
#library(readtext)
Sys.setlocale(category = "LC_ALL", locale = "Greek")
## [1] "LC_COLLATE=Greek_Greece.1253;LC_CTYPE=Greek_Greece.1253;LC_MONETARY=Greek_Greece.1253;LC_NUMERIC=C;LC_TIME=Greek_Greece.1253"
setwd("~/AUTh.MSc/IntroDataAnalytics/semi structured text mining")
data <- readtext(paste0(getwd()), encoding = "UTF-8")
# De-capitalize all letters in the texts
#library(stringr)
data[,2] <- str_to_lower(data[,2], locale = "greek")
Having inserted the data and decapitalized all texts, we are ready for tokenization.
Token: a building block of a natural language; a word or a phrase or a whole sentence.
What is the type of object are we working with? Which sets of functions can we use to tokenize the text?
# Inspect the type of elements in the data object
str(data[1,2])
## chr "δεν προσκομίσθηκε προηγούμενη μαστογραφία –και γενικά προηγούμενες εξετάσεις \n\nιστορικό όγκεκτομής στο άνω "| __truncated__
Our data consists of character objects. An interesting feature arises if we inspect our data: a metacharacter \n. This is the new line metacharacter and has unified the strings of texts into a vector with a reminder of the point where lines changed in the prototype.
To facilitate the process, let us extract the first of the 71 texts and work with it before implementing the process to the entire corpus.
# Read one text file
m1 <- readLines("~/AUTh.MSc/IntroDataAnalytics/semi structured text mining/m01.txt",
encoding = "UTF-8")
Now, the first step is to remove the metacharacter from the text, to begin constructing a tabular form of the text. Additional characters that should be removed in this step are trailing spaces.
Trailing space: a redundant white space between words or at the end of sentences.
## Remove empty lines
m1n <- m1[sapply(m1, nchar) > 0]
# Remove trailing spaces
m1n %<>%
str_trim()
# convert into tidy format
mu <- tibble(lines = 1:length(m1n), text = m1n)
glimpse(mu)
## Rows: 9
## Columns: 2
## $ lines <int> 1, 2, 3, 4, 5, 6, 7, 8, 9
## $ text <chr> "Δεν προσκομίσθηκε προηγούμενη μαστογραφία –και γενικά προηγούμε~
class(mu)
## [1] "tbl_df" "tbl" "data.frame"
There, we have a closer approximation to rectangular data to start extracting information from. We can now use functions from the {tidytext} library that takes dataframes as input and provides a variety of functions to break down their elements.
Tabulation: counting the occurrences of an instance systematically
The unnest_tokens() is a function that takes a dataframe and split its columns into tokens and resulting into one-token-per-row. The first argument is the table (here it is passed through the pipe) and the following two are the output and the input arguments. The output must be named and the input must be present in the table. Both are written as variables and not as characters. Having completed this tabulation the final step is counting the occurrences of each instant irrespective of line position, in the entire text. This is accomplished with the count() from {dplyr}.
# and create a dataframe with two columns: line position and word
library(tidytext)
mu1 <- mu %>%
unnest_tokens(word, text) #%>%
# a nested dataset with lists for each line, one column per word
#pivot_wider(names_from = lines, values_from = sentence)
# beware of the difference in the elements created
class(mu1[1,])
## [1] "tbl_df" "tbl" "data.frame"
class(mu1[[1]])
## [1] "integer"
# tabulate the words found in text
mu1 %<>%
count(word, sort = T)
### make a plot for the terms in text
mu1 %>%
# within the aesthetics, reorder the occurrences to facilitate interpretation
ggplot(aes(y = fct_reorder(word, n), x = n)) +
geom_bar(stat = "identity", fill = "steelblue")
It is evident that some words appear more often than others; we could exclude them in various ways:
Stop word: a commonly used word that serves syntax and grammatical purposes rather than carry meanings
Stemming: removing suffixes
Lemmatization: isolating the lemma; the standard form of the word with no suffix.
The use of regular expressions to clear up a document is based on the use of characters and rules. Indexing and conditionals are heavily used and the main functions are grep*() and str_*().
| Indices for character classes | used |
|---|---|
[[:digit:]] |
to isolate digits |
[[:lower:]] |
to choose lower-case letters |
[[:alnum:]] |
to choose alphanumeric characters |
[[:space:]] |
to choose spaces (also tabs, newlines, returns) |
[[:punct:]] |
to choose punctuation |
Pattern detection and location is the process during which a document is scanned for a predefined set of characters (a pattern) and its metadata (ie first encounter, number of repetitions, position, length).
# define the pattern under investigation
ptrn <- "ACR"
# and detect it
lapply(m1, function(x){str_locate(x, ptrn)})
Can we extract the information on ACR? First we should clear up the string (line 13) where our pattern is found: this is accomplished with the use of regular expressions to identify white spaces \\s* (note the addition of wildcard; human typing is inconsistent so we use this as a precaution with the aim to apply this to the whole corpus) and punctuation used in text. The final regular expression \\w+ defines any word character following the pattern:
str_match(m1[13], "ACR\\s*[[:punct:]]\\s*(\\w+)")
## [,1] [,2]
## [1,] "ACR- Α" "Α"
Excellent! Our data is stored in the second column of the resulting array. We can save it and then pull the information for further analysis over the corpus.
We detected a list of words that do not carry any contextual meaning in the previous step. To investigate associations between words, i.e. phrases, it is advisable to remove these words from our object.
The steps include tokenization into three-word tokens, splitting them into columns to apply a filter for the stop words and then count the frequencies in the remaining document:
stpw <- tibble(word = c("της", "του", "με", "και",
"το", "την", "στην", "ως",
"να", "η", "τα", "στον",
"σε", "που", "στο"))
mmm <- mu %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
separate(trigram, c("w1", "w2", "w3"), sep = " ") %>% # split trigrams to columns
filter(!w1 %in% stpw$word, # remove all obsolete words
!w2 %in% stpw$word,
!w3 %in% stpw$word) %>%
count(w1, w2, w3, sort = T) # count
set.seed(2021)
a <- grid::arrow(type = "open", length = unit(.1, "inches"))
mmm %>%
igraph::graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(arrow = a) +
geom_node_point(colour = "steelblue") +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
The clustering of terms gives some insights on how words are used within the text. The cluster is mostly used for inspection and sharing; the information that can be used from an algorithm is contained in the first line of the graph which creates lists describing the relationships of the words. However, the researcher can evaluate the use of the negation word “δεν” along with other words and the BIRADS and ACR classification. For the latter, we can extract these information; in a similar way as before, and save them for analysis.
# Extract the ACR classification
## unite the trigram data
mmmF <- mmm %>%
unite(trigram, w1, w2, w3, sep = " ")
## look up the pattern
sapply(mmmF, function(x){str_extract(x, "acr")})
## trigram n
## [1,] NA NA
## [2,] NA NA
## [3,] "acr" NA
## [4,] NA NA
## [5,] NA NA
## [6,] NA NA
## [7,] NA NA
## [8,] NA NA
## [9,] NA NA
## [10,] NA NA
## [11,] NA NA
## [12,] NA NA
## [13,] NA NA
## [14,] NA NA
## [15,] NA NA
## [16,] NA NA
## [17,] NA NA
## [18,] NA NA
## [19,] NA NA
## [20,] NA NA
## [21,] NA NA
## [22,] NA NA
## [23,] NA NA
## [24,] NA NA
## [25,] NA NA
## [26,] NA NA
## [27,] NA NA
## and extract the information
str_match(mmmF[3,1], "acr\\s*(\\w+)")
## [,1] [,2]
## [1,] "acr α" "α"
## repeat for BIRADS
sapply(mmmF, function(x){str_extract(x, "birads")})
## trigram n
## [1,] NA NA
## [2,] NA NA
## [3,] NA NA
## [4,] "birads" NA
## [5,] NA NA
## [6,] NA NA
## [7,] NA NA
## [8,] NA NA
## [9,] NA NA
## [10,] NA NA
## [11,] NA NA
## [12,] NA NA
## [13,] NA NA
## [14,] NA NA
## [15,] NA NA
## [16,] NA NA
## [17,] NA NA
## [18,] NA NA
## [19,] NA NA
## [20,] NA NA
## [21,] NA NA
## [22,] NA NA
## [23,] NA NA
## [24,] NA NA
## [25,] NA NA
## [26,] NA NA
## [27,] NA NA
str_match(mmmF[4,1], "birads\\s*(\\w+)")
## [,1] [,2]
## [1,] "birads 0" "0"
A single text is not as informative as a corpus of texts. In our case study, the texts are on the same topic; text mining extends to corpi with different topics. How can we understand the topic of the texts in either case? A suggestion would be to measure the weight of the words (or tokens) that make up the text.
Term frequency: how often is a word used in the corpus?
Of course there are words that are repeated very often to make up meaningful natural language sentences; the stop words we encountered in the single text analysis.
Zipf’s law states that the frequency of any word in the corpus is inversely proportional to its rank in the frequency table
# construct a custom function to remove empty lines from texts
rmeL <- function(x) {
x[sapply(x, nchar) > 0] %>% # removing empty lines
str_trim()
}
# apply the above transformations to the entire body of evidence
dataC <- lapply(data, rmeL)
dataC <- as_tibble(dataC)
# count word appearances in each diagnostic text
diagw <- dataC %>%
unnest_tokens(word, text) %>% # split words
count(doc_id, word, sort = T) # count
# count all words in each diagnostic text
allw <- diagw %>%
group_by(doc_id) %>%
summarize(total = sum(n))
# combine the two datasets
plt <- left_join(diagw, allw)
# Zipf's law
rfrq <- plt %>%
group_by(doc_id) %>%
# add a custom function calculating the inverse of term frequency
mutate(rank = row_number(), `term frequency` = n/total) %>%
ungroup()
rfrq %>%
ggplot(aes(rank, `term frequency`, color = doc_id)) +
geom_line(size = .5, alpha = .25, show.legend = F) +
scale_x_log10() +
scale_y_log10() +
scale_color_viridis_d()
# To add Zipf slope
# filter the most common rankings
frw <- rfrq %>%
filter(rank < 500,
rank > 10)
# calculate a slope with the characteristics of the inverse frequency
lm(log10(`term frequency`) ~ log10(rank), data = frw)
##
## Call:
## lm(formula = log10(`term frequency`) ~ log10(rank), data = frw)
##
## Coefficients:
## (Intercept) log10(rank)
## -1.712 -0.251
# redraw the plot
rfrq %>%
ggplot(aes(rank, `term frequency`, color = doc_id)) +
geom_line(size = .5, alpha = .25, show.legend = F) +
scale_x_log10() +
scale_y_log10() +
scale_color_viridis_d() +
geom_abline(intercept = -1.712, slope = -0.251, color = "blue")
We may consider clearing the dataset before we start applying the processes above, to quantify the occurrence of words (or their stems), their relationships and their use in context with neighboring words in the phrase.
## custom stop words
# count all words and remove those not needed
dataC %>%
unnest_tokens(word, text) %>%
count(word, sort = T)
cstpwrds <- tibble(word = c("με", "η", "το", "και", "της", "του", "των",
"ή", "σε", "στο", "που", "στην", "όσον", "να",
"ως", "τους", "τα", "στις", "στον", "κατά",
"βάσει", "οι", "οποίας", "την", "από", "στη",
"στα", "έχει", "επί", "για", "οποίων", "αυτών"))
# remove the obsolete words from the counts
diago <- anti_join(diagw, cstpwrds, by = "word")
# word count without the stop words
tbo <- diago %>%
group_by(doc_id) %>%
summarize(total = sum(n))
# and recombine a new dataset of counts
plto <- left_join(diago, tbo)
## redraw a plot of the inverse frequency
rfrqo <- plto %>%
group_by(doc_id) %>%
mutate(rank = row_number(), `term frequency` = n/total) %>%
ungroup()
frwo <- rfrqo %>%
filter(rank < 500,
rank > 10)
lm(log10(`term frequency`) ~ log10(rank), data = frwo)
##
## Call:
## lm(formula = log10(`term frequency`) ~ log10(rank), data = frwo)
##
## Coefficients:
## (Intercept) log10(rank)
## -1.89357 -0.09732
rfrqo %>%
ggplot(aes(rank, `term frequency`, color = doc_id)) +
geom_line(size = .5, alpha = .25, show.legend = F) +
scale_x_log10() +
scale_y_log10() +
scale_color_viridis_d() +
geom_abline(intercept = -1.8735, slope = -0.1275, color = "darkslategrey")
# even when obsolete words are removed, our results deviate from Zipf's law
Inverse document frequency: the natural logarithm of the fraction of the total number of documents to the number of documents containing the word
The inverse document frequency decreases the weight for commonly used words. The multiplication of term frequency and inverse document frequency is the statistic tf-idf and it is used to estimate the gravity of a word to a text in a corpus of texts. The higher the statistic the more relevant the word, or any token, for the corpus. It is a method to assign values to the tokens of the texts and allow the algorithm to analyze them. Tasks like information retrieval (like a web crawler) or keyword extraction are enabled with this process. In practice, you could use the approach to identify the theme of the text or to construct a dictionary to identify texts with a similar vocabulary.
# Try to find the most important words in the corpus
## bind_tf_idf()
diago %>%
bind_tf_idf(word, doc_id, n) %>%
arrange(desc(tf_idf))
# A barplot for text No 64 (for illustrative purposes)
diago %>%
bind_tf_idf(word, doc_id, n) %>%
arrange(desc(tf_idf)) %>%
filter(doc_id == "m64.txt") %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf))) +
geom_col(show.legend = F)
Comparing these charts one can infer on the content of the texts, on occasion we may understand the diagnosis. A dictionary may be created, using the tokens with the highest tf-idf, with few mistakes in the process:
diago %>%
bind_tf_idf(word, doc_id, n) %>%
slice_max(tf_idf, n = 150) %>% # get the 50 highest tf-idf terms
ggplot(aes(tf_idf, fct_reorder(word, tf_idf))) +
geom_col()
The majority of words are contextual; mainly the exceptions concern numbers. One may be tempted to remove them from the text but you should bare in mind that our corpus includes diagnostic scales and measurements and they could be useful in further analysis. Also keep in mind that this is a corpus of free text, like most medical records. A small-letter o may stand for a zero in the diagnostic scale.
So far we have examined our corpus word-by-word to try to formulate and idea on the theme and the most commonly used words. However, the meaning in natural languages is rarely held in a single word. Phrases are important pieces of speech that carry information based on the combination of the words they are comprised of. An example of this paradigm is negation: the syntax that reverses the meaning of the same words with the use of a negative word or affix.
The same tokenization process can be implemented to split the text into n-grams.
n-grams: a sequence of n words, as they appear in data
Any number of consecutive words may be applicable depending on the context and the individual linguistic characteristics of different scientific (or other) fields.
# revisit the original dataset
# this time tokenize into four-word phrases
dgrl <- dataC %>%
unnest_tokens(tetragram, text, token = "ngrams", n = 4)
# count, once more
dgrl %>%
count(tetragram, sort = T)
# and calculate the tf-idf
dgrl %>%
count(doc_id, tetragram) %>%
bind_tf_idf(tetragram, doc_id, n) %>%
arrange(desc(tf_idf))
## and plot the results, without stop words
dgrl %>%
separate(tetragram, c("w1", "w2", "w3", "w4"), sep = " ") %>%
filter(!w1 %in% cstpwrds$word, # remove all obsolete words
!w2 %in% cstpwrds$word,
!w3 %in% cstpwrds$word,
!w4 %in% cstpwrds$word) %>%
count(w1, w2, w3, w4) %>%
filter(n > 5) %>%
igraph::graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.05, 'inches')) +
geom_node_point(colour = "steelblue") +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
The visualization above is a Markov chain, a model in text processing in which every word is only dependent on the previous word.
Markov chain: a stochastic process that models a finite set of states, with fixed conditional probabilities of jumping from a given state to another
A Markov chain transitions from one defined state (the word in our case) to the other according to probabilistic rules. In a Markov chain is does not matter how the process arrived at the state (the previous words have no weight), only where it starts from and the time elapsed. A simple example is the one seeking the probability of choosing a certain colour of ball when blindly selecting from a bag of coloured balls with the property of replacing the ball every time one is drawn.
Depending on this model, we can extract information on how words are used in the corpus and perform tasks as text simulation.
Correlating n-grams provides the basis to explore words that tend to co-occur in text even when they do not occur next to each other.
In the following example we are looking for all the combinations of the term “acr” in the corpus.
dataC %>%
unnest_tokens(word, text) %>%
filter(!word %in% cstpwrds$word) %>% # filter stop words
pairwise_count(word, doc_id, sort = TRUE) %>%
filter(item1 == "acr")
## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
Had we stopped the code in the pairwise_count step we would have the entire matrix of word combinations found in the corpus. This would allow us to calculate the correlations of terms, and indication of how often words appear together relatively to how often they appear separately. This metric is called a phi coefficient (binary correlation).
dataC %>%
unnest_tokens(word, text) %>%
filter(!word %in% cstpwrds$word) %>% # filter stop words
group_by(word) %>%
pairwise_cor(word, doc_id, sort = TRUE) %>%
filter(item1 == "acr")
The correlations may help us investigate words with special interest and their most common associates:
dataC %>%
unnest_tokens(word, text) %>%
filter(!word %in% cstpwrds$word) %>% # filter stop words
group_by(word) %>%
pairwise_cor(word, doc_id, sort = TRUE) %>%
filter(item1 %in% c("acr", "μαστού")) %>%
group_by(item1) %>%
slice_max(correlation, n = 8) %>%
ungroup() %>%
ggplot(aes(item2, correlation)) +
geom_bar(stat = "identity") +
facet_wrap(~ item1, scales = "free_y") +
coord_flip()